NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

https://doi.org/10.1109/ACCESS.2024.3391817

Shahabaz, Ahmed; Sarkar, Sudeep (January 2024, IEEE Access)

The joint analysis of audio and video is a powerful tool that can be applied to various contexts, including action, speech, and sound recognition, audio-visual video parsing, emotion recognition in affective computing, and self-supervised training of deep learning models. Solving these problems often involves tackling core audio-visual tasks, such as audio-visual source localization, audio-visual correspondence, and audio-visual source separation, which can be combined in various ways to achieve the desired results. This paper provides a review of the literature in this area, discussing the advancements, history, and datasets of audio-visual learning methods for various application domains. It also presents an overview of the reported performances on standard datasets and suggests promising directions for future research.
more » « less
Full Text Available
Towards Automated Ethogramming: Cognitively-Inspired Event Segmentation for Streaming Wildlife Video Monitoring

https://doi.org/10.1007/s11263-023-01781-2

Mounir, Ramy; Shahabaz, Ahmed; Gula, Roman; Theuerkauf, Jörn; Sarkar, Sudeep (July 2023, International Journal of Computer Vision)

Abstract Advances in visual perceptual tasks have been mainly driven by the amount, and types, of annotations of large-scale datasets. Researchers have focused on fully-supervised settings to train models using offline epoch-based schemes. Despite the evident advancements, limitations and cost of manually annotated datasets have hindered further development for event perceptual tasks, such as detection and localization of objects and events in videos. The problem is more apparent in zoological applications due to the scarcity of annotations and length of videos-most videos are at most ten minutes long. Inspired by cognitive theories, we present a self-supervised perceptual prediction framework to tackle the problem of temporal event segmentation by building a stable representation of event-related objects. The approach is simple but effective. We rely on LSTM predictions of high-level features computed by a standard deep learning backbone. For spatial segmentation, the stable representation of the object is used by an attention mechanism to filter the input features before the prediction step. The self-learned attention maps effectively localize the object as a side effect of perceptual prediction. We demonstrate our approach on long videos from continuous wildlife video monitoring, spanning multiple days at 25 FPS. We aim to facilitate automated ethogramming by detecting and localizing events without the need for labels. Our approach is trained in an online manner on streaming input and requires only a single pass through the video, with no separate training set. Given the lack of long and realistic (includes real-world challenges) datasets, we introduce a new wildlife video dataset–nest monitoring of the Kagu (a flightless bird from New Caledonia)–to benchmark our approach. Our dataset features a video from 10 days (over 23 million frames) of continuous monitoring of the Kagu in its natural habitat. We annotate every frame with bounding boxes and event labels. Additionally, each frame is annotated with time-of-day and illumination conditions. We will make the dataset, which is the first of its kind, and the code available to the research community. We find that the approach significantly outperforms other self-supervised, traditional (e.g., Optical Flow, Background Subtraction) and NN-based (e.g., PA-DPC, DINO, iBOT), baselines and performs on par with supervised boundary detection approaches (i.e., PC). At a recall rate of 80%, our best performing model detects one false positive activity every 50 min of training. On average, we at least double the performance of self-supervised approaches for spatial segmentation. Additionally, we show that our approach is robust to various environmental conditions (e.g., moving shadows). We also benchmark the framework on other datasets (i.e., Kinetics-GEBD, TAPOS) from different domains to demonstrate its generalizability. The data and code are available on our project page: https://aix.eng.usf.edu/research_automated_ethogramming.html
more » « less
Full Text Available
Data from: Towards automated ethogramming: Cognitively-inspired event segmentation for streaming wildlife video monitoring

https://doi.org/10.5061/dryad.kh18932bb

Mounir, Ramy; Shahabaz, Ahmed; Gula, Roman; Theuerkauf, Jörn; Sarkar, Sudeep (January 2023, Dryad)

Our dataset, Nest Monitoring of the Kagu, consists of around ten days (253 hours) of continuous monitoring sampled at 25 frames per second. Our proposed dataset aims to facilitate computer vision research that relates to event detection and localization. We fully annotated the entire dataset (23M frames) with spatial localization labels in the form of a tight bounding box. Additionally, we provide temporal event segmentation labels of five unique bird activities: Feeding, Pushing leaves, Throwing leaves, Walk-In, and Walk-Out. The feeding event represents the period of time when the birds feed the chick. The nest-building events (pushing/throwing leaves) occur when the birds work on the nest during incubation. Pushing leaves is a nest-building behavior during which the birds form a crater by pushing leaves with their legs toward the edges of the nest while sitting on the nest. Throwing leaves is another nest-building behavior during which the birds throw leaves with the bill towards the nest while being, most of the time, outside the nest. Walk-in and walkout events represent the transitioning events from an empty nest to incubation or brooding, and vice versa. We also provide five additional labels that are based on time-of-day and lighting conditions: Day, Night, Sunrise, Sunset, and Shadows. In our manuscript, we provide a baseline approach that detects events and spatially localizes the bird in each frame using an attention mechanism. Our approach does not require any labels and uses a predictive deep learning architecture that is inspired by cognitive psychology studies, specifically, Event Segmentation Theory (EST). We split the dataset such that the first two days are used for validation, and performance evaluation is done on the last eight days. The video monitoring system consisted of a commercial infrared illuminator surveillance camera (Sony 1/3′′ CCD image sensor), and an Electret mini microphone with built-in SMD amplifier (Henri Electronic, Germany), connected to a recording device via a 6.4-mm multicore cable. The transmission cable consisted of a 3-mm coaxial cable for the video signal, a 2.2-mm coaxial cable for the audio signal and two 2-mm (0.75 mm2) cables to power the camera and microphone. We powered the systems with 25-kg deep cycle, lead-acid batteries with a storage capacity of 100 Ah. We used both Archos™ 504 DVRs (with 80 GB hard drives) and Archos 700 DVRs (with 100 GB hard drives). All cameras were equipped with 12 infrared light emitting diodes (LEDs) for night vision. We have manually annotated the dataset with temporal events, time-of-day/lighting conditions, and spatial bounding boxes without relying on any object detection/tracking algorithms. The temporal annotations were initially created by experts who study the behavior of the Kagu bird and later refined to improve the precision of the temporal boundaries. Additional labels, such as lighting conditions, were added during the refinement process. The spatial bounding box annotations of 23M frames were created manually using professional video editing software (Davinci Resolve). We attempted to use available data annotation software tools, but they did not work for the scale of our video (10 days of continuous monitoring). We resorted to video editing software, which helped us annotate and export bounding box masks as videos. The masks were then post-processed to convert annotations from binary mask frames to bounding box coordinates for storage. It is worth noting that the video editing software allowed us to linearly interpolate between keyframes of the bounding boxes annotations, which helped save time and effort when the bird’s motion is linear. Both temporal and spatial annotations were verified by two volunteer graduate students. The process of creating spatial and temporal annotations took approximately two months.
more » « less

Search for: All records